R uses functions to perform operations. To run a function called funcname, we type funcname(input1, input2), where the inputs (or arguments) input1 argument and input2 tell R how to run the function. A function can have any number of inputs. For example, to create a vector of numbers, we use the function c() (for concatenate). Any numbers inside the parentheses are joined together. The following command instructs R to join together the numbers 1, 3, 2, and 5, and to save them as a vector named x. When we type x, it gives us back the vector.
x <- c(1, 3, 2, 5)
x
[1] 1 3 2 5
Note that the > is not part of the command; rather, it is printed by R to indicate that it is ready for another command to be entered. We can also save things using = rather than <-:
x = c(1, 6, 2)
x
[1] 1 6 2
y = c(1, 4, 3)
y
[1] 1 4 3
Hitting the up arrow multiple times will display the previous commands, which can then be edited. This is useful since one often wishes to repeat a similar command. In addition, typing ?funcname will always cause R to open a new help file window with additional information about the function funcname.
We can tell R to add two sets of numbers together. It will then add the first number from x to the first number from y, and so on. However, x and y should be the same length. We can check their length using the length() function.
length(x)
[1] 3
length(y)
[1] 3
x + y
[1] 2 10 5
The ls() function allows us to look at a list of all of the objects, such as data and functions, that we have saved so far. The rm() function can be used to delete any that we don’t want.
ls()
[1] "x" "y"
rm(x, y)
ls()
character(0)
It’s also possible to remove all objects at once:
rm(list = ls())
The matrix() function can be used to create a matrix of numbers. Before we use the matrix() function, we can learn more about it:
?matrix
The help file reveals that the matrix() function takes a number of inputs, but for now we focus on the first three: the data (the entries in the matrix), the number of rows, and the number of columns. First, we create a simple matrix.
x <- matrix(data = c(1, 2, 3, 4), nrow = 2, ncol = 2)
x
[,1] [,2]
[1,] 1 3
[2,] 2 4
Note that we could just as well omit typing data =, nrow =, and ncol = in the matrix() command above: that is, we could just type
x <- matrix(c(1, 2, 3, 4), 2, 2)
and this would have the same effect. However, it can sometimes be useful to specify the names of the arguments passed in, since otherwise R will assume that the function arguments are passed into the function in the same order that is given in the function’s help file. As this example illustrates, by default R creates matrices by successively filling in columns. Alternatively, the byrow = TRUE option can be used to populate the matrix in order of the rows.
matrix(c(1, 2, 3, 4), 2, 2, byrow = TRUE)
[,1] [,2]
[1,] 1 2
[2,] 3 4
Notice that in the above command we did not assign the matrix to a value
such as x. In this case the matrix is printed to the screen but is not saved
for future calculations. The sqrt() function returns the square root of each element of a vector or matrix. The command x^2 raises each element of x to the power 2; any powers are possible, including fractional or negative powers.
sqrt(x)
[,1] [,2]
[1,] 1.000000 1.732051
[2,] 1.414214 2.000000
x^2
[,1] [,2]
[1,] 1 9
[2,] 4 16
The rnorm() function generates a vector of random normal variables, with first argument n the sample size. Each time we call this function, we will get a different answer. Here we create two correlated sets of numbers, x and y, and use the cor() function to compute the correlation between them.
x = rnorm(50)
y = x + rnorm(50, mean = 50, sd = 0.1)
cor(x, y)
[1] 0.9938363
By default, rnorm() creates standard normal random variables with a mean of 0 and a standard deviation of 1. However, the mean and standard deviation can be altered using the mean and sd arguments, as illustrated above. Sometimes we want our code to reproduce the exact same set of random numbers; we can use the set.seed() function to do this. The set.seed() function takes an (arbitrary) integer argument.
set.seed(1303)
rnorm(50)
[1] -1.1439763145 1.3421293656 2.1853904757 0.5363925179 0.0631929665
[6] 0.5022344825 -0.0004167247 0.5658198405 -0.5725226890 -1.1102250073
[11] -0.0486871234 -0.6956562176 0.8289174803 0.2066528551 -0.2356745091
[16] -0.5563104914 -0.3647543571 0.8623550343 -0.6307715354 0.3136021252
[21] -0.9314953177 0.8238676185 0.5233707021 0.7069214120 0.4202043256
[26] -0.2690521547 -1.5103172999 -0.6902124766 -0.1434719524 -1.0135274099
[31] 1.5732737361 0.0127465055 0.8726470499 0.4220661905 -0.0188157917
[36] 2.6157489689 -0.6931401748 -0.2663217810 -0.7206364412 1.3677342065
[41] 0.2640073322 0.6321868074 -1.3306509858 0.0268888182 1.0406363208
[46] 1.3120237985 -0.0300020767 -0.2500257125 0.0234144857 1.6598706557
We use set.seed() throughout the labs whenever we perform calculations involving random quantities. In general this should allow the user to reproduce our results. However, it should be noted that as new versions of R become available it is possible that some small discrepancies may form between the book and the output from R.
The mean() and var() functions can be used to compute the mean and variance of a vector of numbers. Applying sqrt() to the output of var() will give the standard deviation. Or we can simply use the sd() function.
set.seed(3)
y <- rnorm(100)
mean(y)
[1] 0.01103557
var(y)
[1] 0.7328675
sqrt(var(y))
[1] 0.8560768
sd(y)
[1] 0.8560768
The plot() function is the primary way to plot data in R. For instance, plot(x, y) produces a scatterplot of the numbers in x versus the numbers in y. There are many additional options that can be passed in to the plot() function. For example, passing in the argument xlab will result in a label on the x-axis. To find out more information about the plot() function, type ?plot.
x <- rnorm(100)
y <- rnorm(100)
plot(x, y)
plot(x, y, xlab = "this is the x-axis", ylab = "this is the y-axis",
main = "Plot of X vs Y")
We will often want to save the output of an R plot. The command that we use to do this will depend on the file type that we would like to create. For instance, to create a jpeg, we use the jpeg() function, and to create a pdf, we use the pdf() function.
jpeg(file= "./JPG/YourFileName.jpeg")
plot(x, y, col = "green")
dev.off()
quartz_off_screen
2
To display the saved file as shown in Figure 1.1, use the include_graphics() function from knitr.2
Figure 1.1: Using knitr::include_graphics()
The function dev.off() indicates to R that we are done creating the plot. Alternatively, we can simply copy the plot window and paste it into an appropriate file type, such as a Word document.
The function seq() can be used to create a sequence of numbers. For instance, seq(a, b) makes a vector of integers between a and b. There are many other options: for instance, seq(0, 1, length = 10) makes a sequence of 10 numbers that are equally spaced between 0 and 1. Typing 3:11 is a shorthand for seq(3, 11) for integer arguments.
x <- seq(1, 10)
x
[1] 1 2 3 4 5 6 7 8 9 10
x <- 1:10
x
[1] 1 2 3 4 5 6 7 8 9 10
x = seq(-pi, pi, length = 50)
We will now create some more sophisticated plots. The contour() function produces a contour plot in order to represent three-dimensional data; it is like a topographical map. It takes three arguments:
As with the plot() function, there are many other inputs that can be used to fine-tune the output of the contour() function. To learn more about these, take a look at the help file by typing ?contour.
y <- x
f <- outer(x, y, function(x, y){
cos(y) / (1 + x^2)
})
contour(x, y, f)
contour(x, y, f, nlevels = 45, add = TRUE)
fa <- (f -t(f))/2
contour(x, y, fa, nlevels = 15)
The image() function works the same way as contour(), except that it produces a color-coded plot whose colors depend on the z value. This is known as a heatmap, and is sometimes used to plot temperature in weather forecasts. Alternatively, persp() can be used to produce a three-dimensional plot. The arguments theta and phi control the angles at which the plot is persp() viewed.
image(x, y, fa)
persp(x, y, fa)
persp(x, y, fa, theta = 30)
persp(x, y, fa, theta = 30, phi = 20)
persp(x, y, fa, theta = 30, phi = 70)
persp(x, y, fa, theta = 30, phi = 40)
We often wish to examine part of a set of data. Suppose that our data is stored in the matrix A.
A <- matrix(1:16, 4, 4)
A
[,1] [,2] [,3] [,4]
[1,] 1 5 9 13
[2,] 2 6 10 14
[3,] 3 7 11 15
[4,] 4 8 12 16
Then, typing
A[2, 3]
[1] 10
will select the element corresponding to the second row and the third column. The first number after the open-bracket symbol [ always refers to the row, and the second number always refers to the column. We can also select multiple rows and columns at a time, by providing vectors as the indices.
A[c(1, 3), c(2, 4)]
[,1] [,2]
[1,] 5 13
[2,] 7 15
A[1:3, 2:4]
[,1] [,2] [,3]
[1,] 5 9 13
[2,] 6 10 14
[3,] 7 11 15
A[1:2, ]
[,1] [,2] [,3] [,4]
[1,] 1 5 9 13
[2,] 2 6 10 14
A[, 1:2]
[,1] [,2]
[1,] 1 5
[2,] 2 6
[3,] 3 7
[4,] 4 8
The last two examples include either no index for the columns or no index for the rows. These indicate that R should include all columns or all rows, respectively. R treats a single row or column of a matrix as a vector.
A[1, ]
[1] 1 5 9 13
The use of a negative sign - in the index tells R to keep all rows or columns except those indicated in the index.
A[-c(1, 3), ]
[,1] [,2] [,3] [,4]
[1,] 2 6 10 14
[2,] 4 8 12 16
The dim() function outputs the number of rows followed by the number of columns of a given matrix.
dim(A)
[1] 4 4
For most analyses, the first step involves importing a data set into R. The read.table() function is one of the primary ways to do this. The help file contains details about how to use this function. We can use the function write.table() to export data. Before attempting to load a data set, we must make sure that R knows to search for the data in the proper directory. For example on a Windows system one could select the directory using the Change dir. . . option under the File menu. However, the details of how to do this depend on the operating system (e.g. Windows, Mac, Unix) that is being used, and so we do not give further details here. We begin by loading in the Auto data set. This data is part of the ISLR2 library, discussed in Chapter 3. To illustrate the read.table() function we load it now from a text file, Auto.data, which can find on the textbook website. The following command will load the Auto.data file into R and store it as an object called Auto, in a format referred to as a data frame.
Auto <- read.table("Auto.data")
head(Auto)
V1 V2 V3 V4 V5 V6 V7 V8
1 mpg cylinders displacement horsepower weight acceleration year origin
2 18.0 8 307.0 130.0 3504. 12.0 70 1
3 15.0 8 350.0 165.0 3693. 11.5 70 1
4 18.0 8 318.0 150.0 3436. 11.0 70 1
5 16.0 8 304.0 150.0 3433. 12.0 70 1
6 17.0 8 302.0 140.0 3449. 10.5 70 1
V9
1 name
2 chevrolet chevelle malibu
3 buick skylark 320
4 plymouth satellite
5 amc rebel sst
6 ford torino
Note that Auto.data is simply a text file, which you could alternatively open on your computer using a standard text editor. It is often a good idea to view a data set using a text editor or other software such as Excel before loading it into R. This particular data set has not been loaded correctly, because R has assumed that the variable names are part of the data and so has included them in the first row. The data set also includes a number of missing observations, indicated by a question mark ?. Missing values are a common occurrence in real data sets. Using the option header = TRUE in the read.table() function tells R that the first line of the file contains the variable names, and using the option na.strings tells R that any time it sees a particular character or set of characters (such as a question mark), it should be treated as a missing element of the data matrix.
Auto <- read.table("Auto.data", header = TRUE, sep = "", na.strings = "?", stringsAsFactors = TRUE)
head(Auto)
mpg cylinders displacement horsepower weight acceleration year origin
1 18 8 307 130 3504 12.0 70 1
2 15 8 350 165 3693 11.5 70 1
3 18 8 318 150 3436 11.0 70 1
4 16 8 304 150 3433 12.0 70 1
5 17 8 302 140 3449 10.5 70 1
6 15 8 429 198 4341 10.0 70 1
name
1 chevrolet chevelle malibu
2 buick skylark 320
3 plymouth satellite
4 amc rebel sst
5 ford torino
6 ford galaxie 500
library(DT)
datatable(Auto)
Auto[1:4, ]
mpg cylinders displacement horsepower weight acceleration year origin
1 18 8 307 130 3504 12.0 70 1
2 15 8 350 165 3693 11.5 70 1
3 18 8 318 150 3436 11.0 70 1
4 16 8 304 150 3433 12.0 70 1
name
1 chevrolet chevelle malibu
2 buick skylark 320
3 plymouth satellite
4 amc rebel sst
The dim() function tells us that the data has 397 observations, or rows, and 9 variables, or columns. There are various ways to deal with the missing data. In this case, only five of the rows contain missing observations, and so we choose to use the na.omit() function to simply remove these rows.
dim(Auto)
[1] 397 9
Auto2 <- na.omit(Auto)
dim(Auto2)
[1] 392 9
Once the data are loaded correctly, we can use names() to check the variable names.
names(Auto2)
[1] "mpg" "cylinders" "displacement" "horsepower" "weight"
[6] "acceleration" "year" "origin" "name"
We can use the plot() function to produce scatterplots of the quantitative variables. However, simply typing the variable names will produce an error message, because R does not know to look in the Auto data set for those variables. To refer to a variable, we must type the data set and the variable name joined with a $ symbol.
plot(Auto2$cylinders, Auto2$mpg)
plot(mpg ~ cylinders, data = Auto2)
with(data = Auto2,
plot(cylinders, mpg)
)
The cylinders variable is stored as a numeric vector, so R has treated it as quantitative. However, since there are only a small number of possible values for cylinders, one may prefer to treat it as a qualitative variable. The as.factor() function converts quantitative variables into qualitative variables.
Auto2$cylinders <- as.factor(Auto2$cylinders)
If the variable plotted on the x-axis is categorial, then boxplots will automatically be produced by the plot() function. As usual, a number of options can be specified in order to customize the plots.
plot(Auto2$cylinders, Auto2$mpg)
plot(mpg ~ cylinders, data = Auto2)
plot(mpg ~ cylinders, data = Auto2, col = "red")
plot(mpg ~ cylinders, data = Auto2, col = "red", varwidth = TRUE)
plot(mpg ~ cylinders, data = Auto2, col = "red", varwidth = TRUE,
horizontal = TRUE)
plot(mpg ~ cylinders, data = Auto2, col = "red", varwidth = TRUE,
horizontal = TRUE, xlab = "cylinders", ylab = "MPG")
The hist() function can be used to plot a histogram. Note that col = 2 has the same effect as col = "red".
hist(Auto2$mpg, col = "red", xlab = "MPG", main = "Your Title Here")
hist(Auto2$mpg, col = "red", xlab = "MPG", main = "Your Title Here", breaks = 15)
ggplot2See the geom_boxplot documentation and the geom_freqpoly documentation for more details.
library(ggplot2)
p <- ggplot(data = Auto2, aes(x = cylinders, y = mpg))
p + geom_boxplot()
p + geom_boxplot() +
coord_flip()
p + geom_boxplot() +
coord_flip() +
theme_bw()
p + geom_boxplot(fill = "red") +
coord_flip() +
theme_bw()
p + geom_boxplot(fill = "red") +
coord_flip() + theme_bw() +
labs(x = "Cylinders", y = "MPG")
p + geom_boxplot(fill = "red", varwidth = TRUE) +
coord_flip() +
theme_bw() +
labs(x = "Cylinders", y = "MPG")
p <- ggplot(data = Auto2, aes(x = mpg))
p + geom_histogram()
p + geom_histogram(binwidth = 5)
p + geom_histogram(binwidth = 5, fill = "blue")
p + geom_histogram(binwidth = 5, fill = "blue", color = "black")
p + geom_histogram(binwidth = 5, fill = "blue", color = "black") +
theme_bw()
p + geom_histogram(binwidth = 5, fill = "blue",
color = "black", aes(y = ..density..)) +
theme_bw()
ggvislibrary(ggvis)
Auto2 %>%
ggvis(x = ~cylinders, y = ~mpg) %>%
layer_boxplots(fill := "red")
Auto2 %>%
ggvis(x = ~mpg) %>%
layer_histograms(fill := "lightblue", width = 1)
Auto2 %>%
ggvis(x = ~mpg) %>%
layer_histograms(fill := "pink", width = 5) %>%
add_axis("x", title = "Miles Per Gallon")
plotlylibrary(plotly)
p1 <- ggplot(data = Auto2, aes(x = cylinders, y = mpg)) +
geom_boxplot(fill = "red", varwidth = TRUE) +
coord_flip() +
theme_bw() +
labs(x = "Cylinders", y = "MPG")
p2 <- ggplotly(p1)
p2
p3 <- ggplot(data = Auto2, aes(x = mpg)) +
geom_histogram(binwidth = 5, fill = "blue", color = "black") +
theme_bw()
p4 <- ggplotly(p3)
p4
The summary() function produces a numerical summary of each variable in a particular data set.
summary(Auto2)
mpg cylinders displacement horsepower weight
Min. : 9.00 3: 4 Min. : 68.0 Min. : 46.0 Min. :1613
1st Qu.:17.00 4:199 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
Median :22.75 5: 3 Median :151.0 Median : 93.5 Median :2804
Mean :23.45 6: 83 Mean :194.4 Mean :104.5 Mean :2978
3rd Qu.:29.00 8:103 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
Max. :46.60 Max. :455.0 Max. :230.0 Max. :5140
acceleration year origin name
Min. : 8.00 Min. :70.00 Min. :1.000 amc matador : 5
1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 ford pinto : 5
Median :15.50 Median :76.00 Median :1.000 toyota corolla : 5
Mean :15.54 Mean :75.98 Mean :1.577 amc gremlin : 4
3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000 amc hornet : 4
Max. :24.80 Max. :82.00 Max. :3.000 chevrolet chevette: 4
(Other) :365
For qualitative variables such as name, R will list the number of observations that fall in each category. We can also produce a summary of just a single variable.
summary(Auto2$mpg)
Min. 1st Qu. Median Mean 3rd Qu. Max.
9.00 17.00 22.75 23.45 29.00 46.60
dplyrConsider producing summary statistics for the variable mpg when it is grouped by cylinders.
library(dplyr)
Auto2 %>%
group_by(cylinders) %>%
summarize(median(mpg), IQR(mpg), n())
# A tibble: 5 × 4
cylinders `median(mpg)` `IQR(mpg)` `n()`
<fct> <dbl> <dbl> <int>
1 3 20.2 3.3 4
2 4 28.4 7.95 199
3 5 25.4 8.05 3
4 6 19 3 83
5 8 14 3 103
The library() function is used to load libraries, or groups of functions and data sets that are not included in the base R distribution. Basic functions that perform least squares linear regression and other simple analyses come standard with the base distribution, but more exotic functions require additional libraries. Here we load the MASS package, which is a very large collection of data sets and functions. We also load the ISLR package, which includes the data sets associated with this book.
library(MASS)
library(ISLR)
library(DT)
If you receive an error message when loading any of these libraries, it likely indicates that the corresponding library has not yet been installed on your system. Some libraries, such as MASS, come with R and do not need to be separately installed on your computer. However, other packages, such as ISLR, must be downloaded the first time they are used. This can be done directly from within R. For example, on a Windows system, select the Install package option under the Packages tab. After you select any mirror site, a list of available packages will appear. Simply select the package you wish to install and R will automatically download the package. Alternatively, this can be done at the R command line via install.packages("ISLR"). This installation only needs to be done the first time you use a package. However, the library() function must be called each time you wish to use a given package.
The MASS library contains the Boston data set, which records medv (median house value) for 506 neighborhoods around Boston. We will seek to predict medv using 13 predictors such as rm (average number of rooms per house), age (average age of houses), and lstat (percent of households with low socioeconomic status).
names(Boston)
[1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
[8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
datatable(Boston, rownames = FALSE)
To find out more about the data set, we can type ?Boston.
We will start by using the lm() function to fit a simple linear regression model, with medv as the response and lstat as the predictor. The basic lm() syntax is lm(y∼x,data), where y is the response, x is the predictor, and
data is the data set in which these two variables are kept.
lm.fit <- lm(medv ~ lstat)
Error in eval(predvars, data, env): object 'medv' not found
The command causes an error because R does not know where to find the variables medv and lstat.
lm.fit <- lm(medv ~ lstat, data = Boston)
If we type lm.fit, some basic information about the model is output. For more detailed information, we use summary(lm.fit). This gives us p-values and standard errors for the coefficients, as well as the \(R^2\) statistic and F-statistic for the model.
lm.fit
Call:
lm(formula = medv ~ lstat, data = Boston)
Coefficients:
(Intercept) lstat
34.55 -0.95
summary(lm.fit)
Call:
lm(formula = medv ~ lstat, data = Boston)
Residuals:
Min 1Q Median 3Q Max
-15.168 -3.990 -1.318 2.034 24.500
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.55384 0.56263 61.41 <2e-16 ***
lstat -0.95005 0.03873 -24.53 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.216 on 504 degrees of freedom
Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
We can use the names() function in order to find out what other pieces of information are stored in lm.fit. Although we can extract these quantities by name—e.g. lm.fit$coefficients — it is safer to use the extractor functions like coef() to access them.
names(lm.fit)
[1] "coefficients" "residuals" "effects" "rank"
[5] "fitted.values" "assign" "qr" "df.residual"
[9] "xlevels" "call" "terms" "model"
lm.fit$coefficients
(Intercept) lstat
34.5538409 -0.9500494
lm.fit[[1]]
(Intercept) lstat
34.5538409 -0.9500494
coef(lm.fit)
(Intercept) lstat
34.5538409 -0.9500494
In order to obtain a confidence interval for the coefficient estimates, we can use the confint() command.
confint(lm.fit, level = 0.95)
2.5 % 97.5 %
(Intercept) 33.448457 35.6592247
lstat -1.026148 -0.8739505
Consider constructing a confidence interval for \(\beta_1\) using the information provided from the summary of lm.fit:
summary(lm.fit)$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.5538409 0.56262735 61.41515 3.743081e-236
lstat -0.9500494 0.03873342 -24.52790 5.081103e-88
b1 <- summary(lm.fit)$coefficients[2, 1]
sb1 <- summary(lm.fit)$coefficients[2, 2]
ct <- qt(0.975, summary(lm.fit)$df[2])
CI <- b1 + c(-1, 1)*ct*sb1
CI
[1] -1.0261482 -0.8739505
The predict() function can be used to produce confidence intervals and prediction intervals for the prediction of medv for a given value of lstat.
CI <- predict(object = lm.fit, newdata = data.frame(lstat = c(5, 10, 15)),
interval = "confidence")
CI
fit lwr upr
1 29.80359 29.00741 30.59978
2 25.05335 24.47413 25.63256
3 20.30310 19.73159 20.87461
PI <- predict(object = lm.fit, newdata = data.frame(lstat = c(5, 10, 15)),
interval = "predict")
PI
fit lwr upr
1 29.80359 17.565675 42.04151
2 25.05335 12.827626 37.27907
3 20.30310 8.077742 32.52846
For instance, the 95% confidence interval associated with a lstat value of 10 is (24.474132, 25.6325627), and the 95 % prediction interval is (12.8276263, 37.2790683). As expected, the confidence and prediction intervals are centered around the same point (a predicted value of 25.0533473 for medv when lstat equals 10), but the latter are substantially wider.
We will now plot medv and lstat along with the least squares regression line using the plot() and abline() functions.
plot(medv ~ lstat, data = Boston)
abline(lm.fit)
# Or using ggplot2
library(ggplot2)
ggplot(data = Boston, aes(x = lstat, y = medv)) +
geom_point() +
geom_smooth(method = "lm") +
theme_bw()
There is some evidence for non-linearity in the relationship between lstat and medv. We will explore this issue later in this lab.
The abline() function can be used to draw any line, not just the least squares regression line. To draw a line with intercept a and slope b, we type abline(a, b). Below we experiment with some additional settings for plotting lines and points. The lwd = 3 command causes the width of the regression line to be increased by a factor of 3; this works for the plot() and lines() functions also. We can also use the pch option to create different plotting symbols.
plot(medv ~ lstat, data = Boston)
abline(lm.fit, lwd = 3)
plot(medv ~ lstat, data = Boston)
abline(lm.fit, lwd = 3, col = "red")
plot(medv ~ lstat, data = Boston, pch = 20)
plot(medv ~ lstat, data = Boston, pch = "+")
plot(1:20, 1:20, pch = 1:20)
ggplot2ggplot(data = Boston, aes(x = lstat, y = medv)) +
geom_point() +
geom_smooth(method = "lm", color = "red") +
theme_bw()
# thicker line
ggplot(data = Boston, aes(x = lstat, y = medv)) +
geom_point() +
geom_smooth(method = "lm", color = "red", size = 2) +
theme_bw()
# Different plotting shapes
ggplot(data = Boston, aes(x = lstat, y = medv)) +
geom_point(shape = "+", size = 5) +
geom_smooth(method = "lm", color = "red") +
theme_bw()
ggplot(data = Boston, aes(x = lstat, y = medv)) +
geom_point(shape = 23) +
geom_smooth(method = "lm", color = "red") +
theme_bw()
Next we examine some diagnostic plots. Four diagnostic plots are automatically produced by applying the plot() function directly to the output from lm(). In general, this command will produce one plot at a time, and hitting Enter will generate the next plot. However, it is often convenient to view all four plots together. We can achieve this by using the par() function, which tells R to split the display screen into separate panels so that multiple plots can be viewed simultaneously. For example, par(mfrow = c(2, 2)) divides the plotting region into a 2 \(\times\) 2 grid of panels.
opar <- par(no.readonly = TRUE) # copy of current settings
par(mfrow = c(2, 2)) # 2 * 2
plot(lm.fit)
par(opar) # restore original settings
Alternatively, we can compute the residuals from a linear regression fit using the residuals() function. The function rstudent() will return the studentized residuals, and we can use this function to plot the residuals against the fitted values.
opar <- par(no.readonly = TRUE) # copy of current settings
par(mfrow = c(1, 2))
plot(predict(lm.fit), residuals(lm.fit))
plot(predict(lm.fit), rstudent(lm.fit))
par(opar) # restore original settings
I prefer to use the function residualPlots from the car package to evaluate residuals.
car::residualPlots(lm.fit)
Test stat Pr(>|Test stat|)
lstat 11.627 < 2.2e-16 ***
Tukey test 11.627 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
On the basis of the residual plots, there is some evidence of non-linearity. Leverage statistics can be computed for any number of predictors using the hatvalues function. The function influenceIndexPlot from the carpackage creates four diagnostic plots including a plot of the hav-values.
plot(hatvalues(lm.fit))
which.max(hatvalues(lm.fit))
375
375
car::influenceIndexPlot(lm.fit, id.n = 5)
The which.max() function identifies the index of the largest element of a vector. In this case, it tells us which observation has the largest leverage statistic.
Recall that hat-values exceeding \((p + 1)/n = (1 + 1)/504 = 0.0039683\) should be evaluated for high leverage.
In order to fit a multiple linear regression model using least squares, we again use the lm() function. The syntax lm(y ∼ x1 + x2 + x3) is used to fit a model with three predictors, x1, x2, and x3. The summary() function now outputs the regression coefficients for all the predictors.
ls.fit <- lm(medv ~ lstat + age, data = Boston)
summary(ls.fit)
Call:
lm(formula = medv ~ lstat + age, data = Boston)
Residuals:
Min 1Q Median 3Q Max
-15.981 -3.978 -1.283 1.968 23.158
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 33.22276 0.73085 45.458 < 2e-16 ***
lstat -1.03207 0.04819 -21.416 < 2e-16 ***
age 0.03454 0.01223 2.826 0.00491 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.173 on 503 degrees of freedom
Multiple R-squared: 0.5513, Adjusted R-squared: 0.5495
F-statistic: 309 on 2 and 503 DF, p-value: < 2.2e-16
The Boston data set contains 13 variables, and so it would be cumbersome to have to type all of these in order to perform a regression using all of the predictors. Instead, we can use the following short-hand:
ls.fit <- lm(medv ~ ., data = Boston)
summary(ls.fit)
Call:
lm(formula = medv ~ ., data = Boston)
Residuals:
Min 1Q Median 3Q Max
-15.595 -2.730 -0.518 1.777 26.199
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
crim -1.080e-01 3.286e-02 -3.287 0.001087 **
zn 4.642e-02 1.373e-02 3.382 0.000778 ***
indus 2.056e-02 6.150e-02 0.334 0.738288
chas 2.687e+00 8.616e-01 3.118 0.001925 **
nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
age 6.922e-04 1.321e-02 0.052 0.958229
dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
tax -1.233e-02 3.760e-03 -3.280 0.001112 **
ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
black 9.312e-03 2.686e-03 3.467 0.000573 ***
lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.745 on 492 degrees of freedom
Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
We can access the individual components of a summary object by name (type ?summary.lm to see what is available). Hence summary(lm.fit)$r.sq gives us the \(R^2\), and summary(lm.fit)$sigma gives us the RSE. The vif() function, part of the car package, can be used to compute variance inflation factors. Most VIF’s are low to moderate for this data. The car package is not part of the base R installation so it must be downloaded the first time you use it via the install.packages option in R.
library(car)
vif(ls.fit)
crim zn indus chas nox rm age dis
1.792192 2.298758 3.991596 1.073995 4.393720 1.933744 3.100826 3.955945
rad tax ptratio black lstat
7.484496 9.008554 1.799084 1.348521 2.941491
What if we would like to perform a regression using all of the variables but one? For example, in the above regression output, age has a high p-value. So we may wish to run a regression excluding this predictor. The following syntax results in a regression using all predictors except age.
ls.fit1 <- lm(medv ~ . - age, data = Boston)
summary(ls.fit1)
Call:
lm(formula = medv ~ . - age, data = Boston)
Residuals:
Min 1Q Median 3Q Max
-15.6054 -2.7313 -0.5188 1.7601 26.2243
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.436927 5.080119 7.172 2.72e-12 ***
crim -0.108006 0.032832 -3.290 0.001075 **
zn 0.046334 0.013613 3.404 0.000719 ***
indus 0.020562 0.061433 0.335 0.737989
chas 2.689026 0.859598 3.128 0.001863 **
nox -17.713540 3.679308 -4.814 1.97e-06 ***
rm 3.814394 0.408480 9.338 < 2e-16 ***
dis -1.478612 0.190611 -7.757 5.03e-14 ***
rad 0.305786 0.066089 4.627 4.75e-06 ***
tax -0.012329 0.003755 -3.283 0.001099 **
ptratio -0.952211 0.130294 -7.308 1.10e-12 ***
black 0.009321 0.002678 3.481 0.000544 ***
lstat -0.523852 0.047625 -10.999 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.74 on 493 degrees of freedom
Multiple R-squared: 0.7406, Adjusted R-squared: 0.7343
F-statistic: 117.3 on 12 and 493 DF, p-value: < 2.2e-16
# Or
ls.fit2 <-update(ls.fit, .~. - age)
summary(ls.fit2)
Call:
lm(formula = medv ~ crim + zn + indus + chas + nox + rm + dis +
rad + tax + ptratio + black + lstat, data = Boston)
Residuals:
Min 1Q Median 3Q Max
-15.6054 -2.7313 -0.5188 1.7601 26.2243
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.436927 5.080119 7.172 2.72e-12 ***
crim -0.108006 0.032832 -3.290 0.001075 **
zn 0.046334 0.013613 3.404 0.000719 ***
indus 0.020562 0.061433 0.335 0.737989
chas 2.689026 0.859598 3.128 0.001863 **
nox -17.713540 3.679308 -4.814 1.97e-06 ***
rm 3.814394 0.408480 9.338 < 2e-16 ***
dis -1.478612 0.190611 -7.757 5.03e-14 ***
rad 0.305786 0.066089 4.627 4.75e-06 ***
tax -0.012329 0.003755 -3.283 0.001099 **
ptratio -0.952211 0.130294 -7.308 1.10e-12 ***
black 0.009321 0.002678 3.481 0.000544 ***
lstat -0.523852 0.047625 -10.999 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.74 on 493 degrees of freedom
Multiple R-squared: 0.7406, Adjusted R-squared: 0.7343
F-statistic: 117.3 on 12 and 493 DF, p-value: < 2.2e-16
It is easy to include interaction terms in a linear model using the lm() function. The syntax lstat:black tells R to include an interaction term between lstat and black. The syntax lstat*age simultaneously includes lstat, age, and the interaction term lstat\(\times\)age as predictors; it is a shorthand for lstat + age + lstat:age.
summary(lm(medv ~ lstat*age, data = Boston))
Call:
lm(formula = medv ~ lstat * age, data = Boston)
Residuals:
Min 1Q Median 3Q Max
-15.806 -4.045 -1.333 2.085 27.552
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.0885359 1.4698355 24.553 < 2e-16 ***
lstat -1.3921168 0.1674555 -8.313 8.78e-16 ***
age -0.0007209 0.0198792 -0.036 0.9711
lstat:age 0.0041560 0.0018518 2.244 0.0252 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.149 on 502 degrees of freedom
Multiple R-squared: 0.5557, Adjusted R-squared: 0.5531
F-statistic: 209.3 on 3 and 502 DF, p-value: < 2.2e-16
The lm() function can also accommodate non-linear transformations of the predictors. For instance, given a predictor \(X\), we can create a predictor \(X^2\) using I(X^2). The function I() is needed since the ^ has a special meaning
in a formula; wrapping as we do allows the standard usage in R, which is I() to raise \(X\) to the power 2. We now perform a regression of medv onto lstat and lstat\(^2\).
lm.fit2 <- lm(medv ~ lstat + I(lstat^2), data = Boston)
summary(lm.fit2)
Call:
lm(formula = medv ~ lstat + I(lstat^2), data = Boston)
Residuals:
Min 1Q Median 3Q Max
-15.2834 -3.8313 -0.5295 2.3095 25.4148
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 42.862007 0.872084 49.15 <2e-16 ***
lstat -2.332821 0.123803 -18.84 <2e-16 ***
I(lstat^2) 0.043547 0.003745 11.63 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.524 on 503 degrees of freedom
Multiple R-squared: 0.6407, Adjusted R-squared: 0.6393
F-statistic: 448.5 on 2 and 503 DF, p-value: < 2.2e-16
The near-zero p-value associated with the quadratic term suggests that it leads to an improved model. We use the anova() function to further quantify the extent to which the quadratic fit is superior to the linear fit.
anova(lm.fit, lm.fit2)
Analysis of Variance Table
Model 1: medv ~ lstat
Model 2: medv ~ lstat + I(lstat^2)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 504 19472
2 503 15347 1 4125.1 135.2 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here Model 1 (lm.fit) represents the linear submodel containing only one predictor, lstat, while Model 2 (lm.fit2) corresponds to the larger quadratic model that has two predictors, lstat and lstat\(^2\). The anova() function performs a hypothesis test comparing the two models. The null hypothesis is that the two models fit the data equally well, and the alternative hypothesis is that the full model is superior. Here the F-statistic is 135.1998221 and the associated p-value is virtually zero. This provides very clear evidence that the model containing the predictors lstat and lstat\(^2\) is far superior to the model that only contains the predictor lstat. This is not surprising, since earlier we saw evidence for non-linearity in the relationship between medv and lstat. If we type
par(mfrow = c(2, 2))
plot(lm.fit2)
par(mfrow = c(1, 1))
then we see that when the lstat\(^2\) term is included in the model, there is little discernible pattern in the residuals.
In order to create a cubic fit, we can include a predictor of the form I(X^3). However, this approach can start to get cumbersome for higher order polynomials. A better approach involves using the poly() function to create the polynomial within lm(). For example, the following command produces a fifth-order polynomial fit:
lm.fit5 <- lm(medv ~ poly(lstat, 5), data = Boston)
summary(lm.fit5)
Call:
lm(formula = medv ~ poly(lstat, 5), data = Boston)
Residuals:
Min 1Q Median 3Q Max
-13.5433 -3.1039 -0.7052 2.0844 27.1153
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.5328 0.2318 97.197 < 2e-16 ***
poly(lstat, 5)1 -152.4595 5.2148 -29.236 < 2e-16 ***
poly(lstat, 5)2 64.2272 5.2148 12.316 < 2e-16 ***
poly(lstat, 5)3 -27.0511 5.2148 -5.187 3.10e-07 ***
poly(lstat, 5)4 25.4517 5.2148 4.881 1.42e-06 ***
poly(lstat, 5)5 -19.2524 5.2148 -3.692 0.000247 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.215 on 500 degrees of freedom
Multiple R-squared: 0.6817, Adjusted R-squared: 0.6785
F-statistic: 214.2 on 5 and 500 DF, p-value: < 2.2e-16
This suggests that including additional polynomial terms, up to fifth order, leads to an improvement in the model fit! However, further investigation of the data reveals that no polynomial terms beyond fifth order have significant p-values in a regression fit.
library(ggplot2)
ggplot(data = Boston, aes(x = lstat, y = medv)) +
geom_point() +
theme_bw() +
stat_smooth(method = "lm", formula = y ~ poly(x, 5))
Figure 2.1: Fifth order polynomial fit
Of course, we are in no way restricted to using polynomial transformations of the predictors. Here we try a log transformation.
summary(lm(medv ~ log(rm), data = Boston))
Call:
lm(formula = medv ~ log(rm), data = Boston)
Residuals:
Min 1Q Median 3Q Max
-19.487 -2.875 -0.104 2.837 39.816
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -76.488 5.028 -15.21 <2e-16 ***
log(rm) 54.055 2.739 19.73 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.915 on 504 degrees of freedom
Multiple R-squared: 0.4358, Adjusted R-squared: 0.4347
F-statistic: 389.3 on 1 and 504 DF, p-value: < 2.2e-16
library(broom)
library(dplyr)
lm(medv ~ log(rm), data = Boston) %>%
tidy() %>%
knitr::kable(caption = "Summary of the coefficients created using pipes")
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -76.48782 | 5.027725 | -15.21321 | 0 |
| log(rm) | 54.05457 | 2.739460 | 19.73183 | 0 |
ggplot(data = Boston, aes(x = log(rm), y = medv)) +
geom_point() +
theme_bw() +
stat_smooth(method = "lm")
We will now examine the Carseats data, which is part of the ISLR package. We will attempt to predict Sales (child car seat sales) in 400 locations based on a number of predictors.
names(Carseats)
[1] "Sales" "CompPrice" "Income" "Advertising" "Population"
[6] "Price" "ShelveLoc" "Age" "Education" "Urban"
[11] "US"
The Carseats data includes qualitative predictors such as Shelveloc, an indicator of the quality of the shelving location—–that is, the space within a store in which the car seat is displayed—–at each location. The predictor Shelveloc takes on three possible values, Bad, Medium, and Good. Given a qualitative variable such as Shelveloc, R generates dummy variables automatically. Below we fit a multiple regression model that includes some interaction terms.
lm.fit <- lm(Sales ~ . + Income:Advertising + Price:Age, data = Carseats)
summary(lm.fit)
Call:
lm(formula = Sales ~ . + Income:Advertising + Price:Age, data = Carseats)
Residuals:
Min 1Q Median 3Q Max
-2.9208 -0.7503 0.0177 0.6754 3.3413
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.5755654 1.0087470 6.519 2.22e-10 ***
CompPrice 0.0929371 0.0041183 22.567 < 2e-16 ***
Income 0.0108940 0.0026044 4.183 3.57e-05 ***
Advertising 0.0702462 0.0226091 3.107 0.002030 **
Population 0.0001592 0.0003679 0.433 0.665330
Price -0.1008064 0.0074399 -13.549 < 2e-16 ***
ShelveLocGood 4.8486762 0.1528378 31.724 < 2e-16 ***
ShelveLocMedium 1.9532620 0.1257682 15.531 < 2e-16 ***
Age -0.0579466 0.0159506 -3.633 0.000318 ***
Education -0.0208525 0.0196131 -1.063 0.288361
UrbanYes 0.1401597 0.1124019 1.247 0.213171
USYes -0.1575571 0.1489234 -1.058 0.290729
Income:Advertising 0.0007510 0.0002784 2.698 0.007290 **
Price:Age 0.0001068 0.0001333 0.801 0.423812
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.011 on 386 degrees of freedom
Multiple R-squared: 0.8761, Adjusted R-squared: 0.8719
F-statistic: 210 on 13 and 386 DF, p-value: < 2.2e-16
The contrasts() function returns the coding that R uses for the dummy variables.
contrasts(Carseats$ShelveLoc)
Good Medium
Bad 0 0
Good 1 0
Medium 0 1
Use ?contrasts to learn about other contrasts, and how to set them.
R has created a ShelveLocGood dummy variable that takes on a value of 1 if the shelving location is good, and 0 otherwise. It has also created a ShelveLocMedium dummy variable that equals 1 if the shelving location is medium, and 0 otherwise. A bad shelving location corresponds to a zero for each of the two dummy variables. The fact that the coefficient for ShelveLocGood in the regression output is positive indicates that a good shelving location is associated with high sales (relative to a bad location). And ShelveLocMedium has a smaller positive coefficient, indicating that a medium shelving location leads to higher sales than a bad shelving location but lower sales than a good shelving location.
We will begin by examining some numerical and graphical summaries of the Smarket data, which is part of the ISLR2 library. This data set consists of percentage returns for the S&P 500 stock index over 1, 250 days, from the beginning of 2001 until the end of 2005. For each date, we have recorded the percentage returns for each of the five previous trading days, Lag1 through Lag5. We have also recorded Volume (the number of shares traded on the previous day, in billions), Today (the percentage return on the date in question) and Direction (whether the market was Up or Down on this date). Our goal is to predict Direction (a qualitative response) using the other features.
library(ISLR2)
names(Smarket)
[1] "Year" "Lag1" "Lag2" "Lag3" "Lag4" "Lag5"
[7] "Volume" "Today" "Direction"
dim(Smarket)
[1] 1250 9
summary(Smarket)
Year Lag1 Lag2 Lag3
Min. :2001 Min. :-4.922000 Min. :-4.922000 Min. :-4.922000
1st Qu.:2002 1st Qu.:-0.639500 1st Qu.:-0.639500 1st Qu.:-0.640000
Median :2003 Median : 0.039000 Median : 0.039000 Median : 0.038500
Mean :2003 Mean : 0.003834 Mean : 0.003919 Mean : 0.001716
3rd Qu.:2004 3rd Qu.: 0.596750 3rd Qu.: 0.596750 3rd Qu.: 0.596750
Max. :2005 Max. : 5.733000 Max. : 5.733000 Max. : 5.733000
Lag4 Lag5 Volume Today
Min. :-4.922000 Min. :-4.92200 Min. :0.3561 Min. :-4.922000
1st Qu.:-0.640000 1st Qu.:-0.64000 1st Qu.:1.2574 1st Qu.:-0.639500
Median : 0.038500 Median : 0.03850 Median :1.4229 Median : 0.038500
Mean : 0.001636 Mean : 0.00561 Mean :1.4783 Mean : 0.003138
3rd Qu.: 0.596750 3rd Qu.: 0.59700 3rd Qu.:1.6417 3rd Qu.: 0.596750
Max. : 5.733000 Max. : 5.73300 Max. :3.1525 Max. : 5.733000
Direction
Down:602
Up :648
pairs(Smarket)
The cor() function produces a matrix that contains all of the pairwise correlations among the predictors in a data set. The first command below gives an error message because the Direction variable is qualitative.
cor(Smarket)
#Error in cor(Smarket) : ‘x’ must be numeric
cor(Smarket[, -9])
Year Lag1 Lag2 Lag3 Lag4
Year 1.00000000 0.029699649 0.030596422 0.033194581 0.035688718
Lag1 0.02969965 1.000000000 -0.026294328 -0.010803402 -0.002985911
Lag2 0.03059642 -0.026294328 1.000000000 -0.025896670 -0.010853533
Lag3 0.03319458 -0.010803402 -0.025896670 1.000000000 -0.024051036
Lag4 0.03568872 -0.002985911 -0.010853533 -0.024051036 1.000000000
Lag5 0.02978799 -0.005674606 -0.003557949 -0.018808338 -0.027083641
Volume 0.53900647 0.040909908 -0.043383215 -0.041823686 -0.048414246
Today 0.03009523 -0.026155045 -0.010250033 -0.002447647 -0.006899527
Lag5 Volume Today
Year 0.029787995 0.53900647 0.030095229
Lag1 -0.005674606 0.04090991 -0.026155045
Lag2 -0.003557949 -0.04338321 -0.010250033
Lag3 -0.018808338 -0.04182369 -0.002447647
Lag4 -0.027083641 -0.04841425 -0.006899527
Lag5 1.000000000 -0.02200231 -0.034860083
Volume -0.022002315 1.00000000 0.014591823
Today -0.034860083 0.01459182 1.000000000
As one would expect, the correlations between the lag variables and today’s returns are close to zero. In other words, there appears to be little correlation between today’s returns and previous days’ returns. The only substantial correlation is between Year and Volume. By plotting the data, which is ordered chronologically, we see that Volume is increasing over time. In other words, the average number of shares traded daily increased from 2001 to 2005.
attach(Smarket)
plot(Volume)
Next, we will fit a logistic regression model in order to predict Direction using Lag1 through Lag5 and Volume. The glm() function can be used to fit many types of generalized linear models, including logistic regression. The syntax of the glm() function is similar to that of lm(), except that we must pass in the argument family = binomial in order to tell R to run a logistic regression rather than some other type of generalized linear model.
glm.fits <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,data = Smarket, family = binomial)
summary(glm.fits)
Call:
glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
Volume, family = binomial, data = Smarket)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.126000 0.240736 -0.523 0.601
Lag1 -0.073074 0.050167 -1.457 0.145
Lag2 -0.042301 0.050086 -0.845 0.398
Lag3 0.011085 0.049939 0.222 0.824
Lag4 0.009359 0.049974 0.187 0.851
Lag5 0.010313 0.049511 0.208 0.835
Volume 0.135441 0.158360 0.855 0.392
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1731.2 on 1249 degrees of freedom
Residual deviance: 1727.6 on 1243 degrees of freedom
AIC: 1741.6
Number of Fisher Scoring iterations: 3
The smallest p-value here is associated with Lag1. The negative coefficient for this predictor suggests that if the market had a positive return yesterday, then it is less likely to go up today. However, at a value of 0.15, the p-value is still relatively large, and so there is no clear evidence of a real association between Lag1 and Direction.
We use the coef() function in order to access just the coefficients for this fitted model. We can also use the summary() function to access particular aspects of the fitted model, such as the p-values for the coefficients.
coef(glm.fits)
(Intercept) Lag1 Lag2 Lag3 Lag4 Lag5
-0.126000257 -0.073073746 -0.042301344 0.011085108 0.009358938 0.010313068
Volume
0.135440659
summary(glm.fits)$coef
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.126000257 0.24073574 -0.5233966 0.6006983
Lag1 -0.073073746 0.05016739 -1.4565986 0.1452272
Lag2 -0.042301344 0.05008605 -0.8445733 0.3983491
Lag3 0.011085108 0.04993854 0.2219750 0.8243333
Lag4 0.009358938 0.04997413 0.1872757 0.8514445
Lag5 0.010313068 0.04951146 0.2082966 0.8349974
Volume 0.135440659 0.15835970 0.8552723 0.3924004
summary(glm.fits)$coef[, 4]
(Intercept) Lag1 Lag2 Lag3 Lag4 Lag5
0.6006983 0.1452272 0.3983491 0.8243333 0.8514445 0.8349974
Volume
0.3924004
The predict() function can be used to predict the probability that the market will go up, given values of the predictors. The type = "response" option tells R to output probabilities of the form \(P(Y = 1|X)\), as opposed to other information such as the logit. If no data set is supplied to the predict() function, then the probabilities are computed for the training data that was used to fit the logistic regression model. Here we have printed only the first ten probabilities. We know that these values correspond to the probability of the market going up, rather than down, because the contrasts() function indicates that R has created a dummy variable with a 1 for Up.
glm.probs <- predict(glm.fits , type = "response")
glm.probs[1:10]
1 2 3 4 5 6 7 8
0.5070841 0.4814679 0.4811388 0.5152224 0.5107812 0.5069565 0.4926509 0.5092292
9 10
0.5176135 0.4888378
contrasts(Direction)
Up
Down 0
Up 1
In order to make a prediction as to whether the market will go up or down on a particular day, we must convert these predicted probabilities into class labels, Up or Down. The following two commands create a vector of class predictions based on whether the predicted probability of a market increase is greater than or less than 0.5.
glm.pred <- rep("Down", 1250)
glm.pred[glm.probs > .5] = "Up"
The first command creates a vector of 1,250 Down elements. The second line transforms to Up all of the elements for which the predicted probability of a market increase exceeds 0.5. Given these predictions, the table() function can be used to produce a confusion matrix in order to determine how many observations were correctly or incorrectly classified.
table(glm.pred , Direction)
Direction
glm.pred Down Up
Down 145 141
Up 457 507
(507 + 145) / 1250
[1] 0.5216
mean(glm.pred == Direction)
[1] 0.5216
The diagonal elements of the confusion matrix indicate correct predictions, while the off-diagonals represent incorrect predictions. Hence our model correctly predicted that the market would go up on 507 days and that it would go down on 145 days, for a total of 507 + 145 = 652 correct predictions. The mean() function can be used to compute the fraction of days for which the prediction was correct. In this case, logistic regression correctly predicted the movement of the market \(52.2\%\) of the time.
At first glance, it appears that the logistic regression model is working a little better than random guessing. However, this result is misleading because we trained and tested the model on the same set of 1,250 observations. In other words, \(100\% - 52.2\% = 47.8\%\), is the training error rate. As we have seen previously, the training error rate is often overly optimistic- it tends to underestimate the test error rate. In order to better assess the accuracy of the logistic regression model in this setting, we can fit the model using part of the data, and then examine how well it predicts the held out data. This will yield a more realistic error rate, in the sense that in practice we will be interested in our model’s performance not on the data that we used to fit the model, but rather on days in the future for which the market’s movements are unknown.
To implement this strategy, we will first create a vector corresponding to the observations from 2001 through 2004. We will then use this vector to create a held out data set of observations from 2005.
train <- (Year < 2005)
Smarket.2005 <- Smarket[!train, ]
dim(Smarket.2005)
[1] 252 9
Direction.2005 <- Direction[!train]
The object train is a vector of 1,250 elements, corresponding to the observations in our data set. The elements of the vector that correspond to observations that occurred before 2005 are set to TRUE, whereas those that correspond to observations in 2005 are set to FALSE. The object train is a Boolean vector, since its elements are TRUE and FALSE. Boolean vectors can be used to obtain a subset of the rows or columns of a matrix. For instance, the command Smarket[train, ] would pick out a submatrix of the stock market data set, corresponding only to the dates before 2005, since those are the ones for which the elements of train are TRUE. The ! symbol can be used to reverse all of the elements of a Boolean vector. That is, !train is a vector similar to train, except that the elements that are TRUE in train get swapped to FALSE in !train, and the elements that are FALSE in train get swapped to TRUE in !train. Therefore, Smarket[!train, ] yields a submatrix of the stock market data containing only the observations for which train is FALSE- that is, the observations with dates in 2005. The output above indicates that there are 252 such observations.
We now fit a logistic regression model using only the subset of the observations that correspond to dates before 2005, using the subset argument. We then obtain predicted probabilities of the stock market going up for each of the days in our test set- that is, for the days in 2005.
glm.fits <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume , data = Smarket , family = binomial , subset = train)
glm.probs <- predict(glm.fits , Smarket.2005, type = "response")
Notice that we have trained and tested our model on two completely separate data sets: training was performed using only the dates before 2005, and testing was performed using only the dates in 2005. Finally, we compute the predictions for 2005 and compare them to the actual movements of the market over that time period.
glm.pred <- rep("Down", 252)
glm.pred[glm.probs > .5] <- "Up"
table(glm.pred , Direction.2005)
Direction.2005
glm.pred Down Up
Down 77 97
Up 34 44
mean(glm.pred == Direction.2005)
[1] 0.4801587
mean(glm.pred != Direction.2005)
[1] 0.5198413
The != notation means not equal to, and so the last command computes the test set error rate. The results are rather disappointing: the test error rate is \(52\%\), which is worse than random guessing! Of course this result is not all that surprising, given that one would not generally expect to be able to use previous days’ returns to predict future market performance. (After all, if it were possible to do so, then the authors of this book would be out striking it rich rather than writing a statistics textbook.)
We recall that the logistic regression model had very underwhelming pvalues associated with all of the predictors, and that the smallest p-value, though not very small, corresponded to Lag1. Perhaps by removing the variables that appear not to be helpful in predicting Direction, we can
obtain a more effective model. After all, using predictors that have no relationship with the response tends to cause a deterioration in the test error rate (since such predictors cause an increase in variance without a corresponding decrease in bias), and so removing such predictors may in turn yield an improvement. Below we have refit the logistic regression using just Lag1 and Lag2, which seemed to have the highest predictive power in the original logistic regression model.
glm.fits <- glm(Direction ~ Lag1 + Lag2 , data = Smarket , family = binomial , subset = train)
glm.probs <- predict(glm.fits , Smarket.2005,type = "response")
glm.pred <- rep("Down", 252)
glm.pred[glm.probs > .5] <- "Up"
table(glm.pred , Direction.2005)
Direction.2005
glm.pred Down Up
Down 35 35
Up 76 106
mean(glm.pred == Direction.2005)
[1] 0.5595238
106 / (106 + 76)
[1] 0.5824176
Now the results appear to be a little better: \(56\%\) of the daily movements have been correctly predicted. It is worth noting that in this case, a much simpler strategy of predicting that the market will increase every day will also be correct \(56\%\) of the time! Hence, in terms of overall error rate, the logistic regression method is no better than the naive approach. However, the confusion matrix shows that on days when logistic regression predicts an increase in the market, it has a \(58\%\) accuracy rate. This suggests a possible trading strategy of buying on days when the model predicts an increasing market, and avoiding trades on days when a decrease is predicted. Of course one would need to investigate more carefully whether this small improvement was real or just due to random chance.
Suppose that we want to predict the returns associated with particular values of Lag1 and Lag2. In particular, we want to predict Direction on a day when Lag1 and Lag2 equal 1.2 and 1.1, respectively, and on a day when they equal 1.5 and -0.8. We do this using the predict() function.
predict(glm.fits, newdata = data.frame(Lag1 = c(1.2 , 1.5) , Lag2 = c(1.1 , -0.8)),type = "response")
1 2
0.4791462 0.4960939
We will now perform KNN using the knn() function, which is part of the class library. This function works rather differently from the other modelfitting functions that we have encountered thus far. Rather than a two-step approach in which we first fit the model and then we use the model to make predictions, knn() forms predictions using a single command. The function requires four inputs.
A matrix containing the predictors associated with the training data, labeled train.X below.
A matrix containing the predictors associated with the data for which we wish to make predictions, labeled test.X below.
A vector containing the class labels for the training observations, labeled train.Direction below.
A value for \(K\), the number of nearest neighbors to be used by the classifier.
We use the cbind() function, short for column bind, to bind the Lag1 and Lag2 variables together into two matrices, one for the training set and the other for the test set.
library(class)
train.X <- cbind(Lag1 , Lag2)[train , ]
test.X <- cbind(Lag1 , Lag2)[!train , ]
train.Direction <- Direction[train]
Now the knn() function can be used to predict the market’s movement for the dates in 2005. We set a random seed before we apply knn() because if several observations are tied as nearest neighbors, then R will randomly break the tie. Therefore, a seed must be set in order to ensure reproducibility of results.
set.seed (1)
knn.pred <- knn(train.X, test.X, train.Direction , k = 1)
table(knn.pred , Direction.2005)
Direction.2005
knn.pred Down Up
Down 43 58
Up 68 83
(83 + 43) / 252
[1] 0.5
The results using \(K = 1\) are not very good, since only \(50\%\) of the observations are correctly predicted. Of course, it may be that \(K = 1\) results in an overly flexible fit to the data. Below, we repeat the analysis using \(K = 3\).
knn.pred <- knn(train.X, test.X, train.Direction , k = 3)
table(knn.pred , Direction.2005)
Direction.2005
knn.pred Down Up
Down 48 54
Up 63 87
mean(knn.pred == Direction.2005)
[1] 0.5357143
The results have improved slightly. But increasing \(K\) further turns out to provide no further improvements.
KNN does not perform well on the Smarket data but it does often provide impressive results. As an example we will apply the KNN approach to the Caravan data set, which is part of the ISLR2 library. This data set includes 85 predictors that measure demographic characteristics for 5,822 individuals. The response variable is Purchase, which indicates whether or not a given individual purchases a caravan insurance policy. In this data set, only \(6\%\) of people purchased caravan insurance.
dim(Caravan)
[1] 5822 86
attach(Caravan)
summary(Purchase)
No Yes
5474 348
348/5822
[1] 0.05977327
Because the KNN classifier predicts the class of a given test observation by identifying the observations that are nearest to it, the scale of the variables matters. Variables that are on a large scale will have a much larger effect on the distance between the observations, and hence on the KNN classifier, than variables that are on a small scale. For instance, imagine a data set that contains two variables, salary and age (measured in dollars and years, respectively). As far as KNN is concerned, a difference of \(\$1,000\) in salary is enormous compared to a difference of 50 years in age. Consequently, salary will drive the KNN classification results, and age will have almost no effect. This is contrary to our intuition that a salary difference of \(\$1,000\) is quite small compared to an age difference of 50 years. Furthermore, the importance of scale to the KNN classifier leads to another issue: if we measured salary in Japanese yen, or if we measured age in minutes, then we’d get quite different classification results from what we get if these two variables are measured in dollars and years.
A good way to handle this problem is to standardize the data so that all variables are given a mean of zero and a standard deviation of one. Then all variables will be on a comparable scale. The scale() function does just this. In standardizing the data, we exclude column 86, because that is the qualitative Purchase variable.
standardized.X <- scale(Caravan[, -86])
var(Caravan[, 1])
[1] 165.0378
var(Caravan[, 2])
[1] 0.1647078
var(standardized.X[, 1])
[1] 1
var(standardized.X[, 2])
[1] 1
Now every column of standardized.X has a standard deviation of one and a mean of zero.
We now split the observations into a test set, containing the first 1,000 observations, and a training set, containing the remaining observations. We fit a KNN model on the training data using \(K = 1\), and evaluate its performance on the test data.
test <- 1:1000
train.X <- standardized.X[-test, ]
test.X <- standardized.X[test, ]
train.Y <- Purchase[-test]
test.Y <- Purchase[test]
set.seed (1)
knn.pred <- knn(train.X, test.X, train.Y, k = 1)
mean(test.Y != knn.pred)
[1] 0.118
mean(test.Y != "No")
[1] 0.059
The vector test is numeric, with values from 1 through 1, 000. Typing standardized.X[test, ] yields the submatrix of the data containing the observations whose indices range from 1 to 1, 000, whereas typing standardized.X[-test, ] yields the submatrix containing the observations
whose indices do not range from 1 to 1, 000. The KNN error rate on the 1,000 test observations is just under \(12\%\). At first glance, this may appear to be fairly good. However, since only \(6\%\) of customers purchased insurance, we could get the error rate down to \(6\%\) by always predicting No regardless of the values of the predictors!
Suppose that there is some non-trivial cost to trying to sell insurance to a given individual. For instance, perhaps a salesperson must visit each potential customer. If the company tries to sell insurance to a random selection of customers, then the success rate will be only \(6\%\), which may be far too low given the costs involved. Instead, the company would like to try to sell insurance only to customers who are likely to buy it. So the overall error rate is not of interest. Instead, the fraction of individuals that are correctly predicted to buy insurance is of interest.
It turns out that KNN with \(K = 1\) does far better than random guessing among the customers that are predicted to buy insurance. Among 77 such customers, 9, or \(11.7\%\), actually do purchase insurance. This is double the rate that one would obtain from random guessing.
table(knn.pred , test.Y)
test.Y
knn.pred No Yes
No 873 50
Yes 68 9
9 / (68 + 9)
[1] 0.1168831
Using \(K = 3\), the success rate increases to \(19\%\), and with \(K = 5\) the rate is \(26.7\%\). This is over four times the rate that results from random guessing. It appears that KNN is finding some real patterns in a difficult data set!
knn.pred <- knn(train.X, test.X, train.Y, k = 3)
table(knn.pred , test.Y)
test.Y
knn.pred No Yes
No 920 54
Yes 21 5
5 / 26
[1] 0.1923077
knn.pred <- knn(train.X, test.X, train.Y, k = 5)
table(knn.pred , test.Y)
test.Y
knn.pred No Yes
No 930 55
Yes 11 4
4 / 15
[1] 0.2666667
However, while this strategy is cost-effective, it is worth noting that only 15 customers are predicted to purchase insurance using KNN with \(K = 5\). In practice, the insurance company may wish to expend resources on convincing more than just 15 potential customers to buy insurance.
As a comparison, we can also fit a logistic regression model to the data. If we use 0.5 as the predicted probability cut-off for the classifier, then we have a problem: only seven of the test observations are predicted to purchase insurance. Even worse, we are wrong about all of these! However, we are not required to use a cut-off of 0.5. If we instead predict a purchase any time the predicted probability of purchase exceeds 0.25, we get much better results: we predict that 33 people will purchase insurance, and we are correct for about \(33\%\) of these people. This is over five times better than random guessing!
glm.fits <- glm(Purchase ~ ., data = Caravan ,family = binomial , subset = -test)
glm.probs <- predict(glm.fits, Caravan[test , ],type = "response")
glm.pred <- rep("No", 1000)
glm.pred[glm.probs > .5] <- "Yes"
table(glm.pred , test.Y)
test.Y
glm.pred No Yes
No 934 59
Yes 7 0
glm.pred <- rep("No", 1000)
glm.pred[glm.probs > .25] <- "Yes"
table(glm.pred , test.Y)
test.Y
glm.pred No Yes
No 919 48
Yes 22 11
11 / (22+11)
[1] 0.3333333
Finally, we fit a Poisson regression model to the Bikeshare data set, which measures the number of bike rentals (bikers) per hour in Washington, DC. The data can be found in the ISLR2 library.
attach(Bikeshare)
dim(Bikeshare)
[1] 8645 15
names(Bikeshare)
[1] "season" "mnth" "day" "hr" "holiday"
[6] "weekday" "workingday" "weathersit" "temp" "atemp"
[11] "hum" "windspeed" "casual" "registered" "bikers"
We begin by fitting a least squares linear regression model to the data.
mod.lm <- lm( bikers ~ mnth + hr + workingday + temp + weathersit , data = Bikeshare)
summary(mod.lm)
Call:
lm(formula = bikers ~ mnth + hr + workingday + temp + weathersit,
data = Bikeshare)
Residuals:
Min 1Q Median 3Q Max
-299.00 -45.70 -6.23 41.08 425.29
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -68.632 5.307 -12.932 < 2e-16 ***
mnthFeb 6.845 4.287 1.597 0.110398
mnthMarch 16.551 4.301 3.848 0.000120 ***
mnthApril 41.425 4.972 8.331 < 2e-16 ***
mnthMay 72.557 5.641 12.862 < 2e-16 ***
mnthJune 67.819 6.544 10.364 < 2e-16 ***
mnthJuly 45.324 7.081 6.401 1.63e-10 ***
mnthAug 53.243 6.640 8.019 1.21e-15 ***
mnthSept 66.678 5.925 11.254 < 2e-16 ***
mnthOct 75.834 4.950 15.319 < 2e-16 ***
mnthNov 60.310 4.610 13.083 < 2e-16 ***
mnthDec 46.458 4.271 10.878 < 2e-16 ***
hr1 -14.579 5.699 -2.558 0.010536 *
hr2 -21.579 5.733 -3.764 0.000168 ***
hr3 -31.141 5.778 -5.389 7.26e-08 ***
hr4 -36.908 5.802 -6.361 2.11e-10 ***
hr5 -24.135 5.737 -4.207 2.61e-05 ***
hr6 20.600 5.704 3.612 0.000306 ***
hr7 120.093 5.693 21.095 < 2e-16 ***
hr8 223.662 5.690 39.310 < 2e-16 ***
hr9 120.582 5.693 21.182 < 2e-16 ***
hr10 83.801 5.705 14.689 < 2e-16 ***
hr11 105.423 5.722 18.424 < 2e-16 ***
hr12 137.284 5.740 23.916 < 2e-16 ***
hr13 136.036 5.760 23.617 < 2e-16 ***
hr14 126.636 5.776 21.923 < 2e-16 ***
hr15 132.087 5.780 22.852 < 2e-16 ***
hr16 178.521 5.772 30.927 < 2e-16 ***
hr17 296.267 5.749 51.537 < 2e-16 ***
hr18 269.441 5.736 46.976 < 2e-16 ***
hr19 186.256 5.714 32.596 < 2e-16 ***
hr20 125.549 5.704 22.012 < 2e-16 ***
hr21 87.554 5.693 15.378 < 2e-16 ***
hr22 59.123 5.689 10.392 < 2e-16 ***
hr23 26.838 5.688 4.719 2.41e-06 ***
workingday 1.270 1.784 0.711 0.476810
temp 157.209 10.261 15.321 < 2e-16 ***
weathersitcloudy/misty -12.890 1.964 -6.562 5.60e-11 ***
weathersitlight rain/snow -66.494 2.965 -22.425 < 2e-16 ***
weathersitheavy rain/snow -109.745 76.667 -1.431 0.152341
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 76.5 on 8605 degrees of freedom
Multiple R-squared: 0.6745, Adjusted R-squared: 0.6731
F-statistic: 457.3 on 39 and 8605 DF, p-value: < 2.2e-16
In mod.lm, the first level of hr (0) and mnth (Jan) are treated as the baseline values, and so no coefficient estimates are provided for them: implicitly, their coefficient estimates are zero, and all other levels are measured relative to these baselines. For example, the Feb coefficient of 6.845 signifies that, holding all other variables constant, there are on average about 7 more riders in February than in January. Similarly there are about 16.5 more riders in March than in January.
A slightly different coding of the variables hr and mnth, is given as follows:
contrasts(Bikeshare$hr) = contr.sum(24)
contrasts(Bikeshare$mnth) = contr.sum(12)
mod.lm2 <- lm(bikers ~ mnth + hr + workingday + temp + weathersit ,data = Bikeshare)
summary(mod.lm2)
Call:
lm(formula = bikers ~ mnth + hr + workingday + temp + weathersit,
data = Bikeshare)
Residuals:
Min 1Q Median 3Q Max
-299.00 -45.70 -6.23 41.08 425.29
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 73.5974 5.1322 14.340 < 2e-16 ***
mnth1 -46.0871 4.0855 -11.281 < 2e-16 ***
mnth2 -39.2419 3.5391 -11.088 < 2e-16 ***
mnth3 -29.5357 3.1552 -9.361 < 2e-16 ***
mnth4 -4.6622 2.7406 -1.701 0.08895 .
mnth5 26.4700 2.8508 9.285 < 2e-16 ***
mnth6 21.7317 3.4651 6.272 3.75e-10 ***
mnth7 -0.7626 3.9084 -0.195 0.84530
mnth8 7.1560 3.5347 2.024 0.04295 *
mnth9 20.5912 3.0456 6.761 1.46e-11 ***
mnth10 29.7472 2.6995 11.019 < 2e-16 ***
mnth11 14.2229 2.8604 4.972 6.74e-07 ***
hr1 -96.1420 3.9554 -24.307 < 2e-16 ***
hr2 -110.7213 3.9662 -27.916 < 2e-16 ***
hr3 -117.7212 4.0165 -29.310 < 2e-16 ***
hr4 -127.2828 4.0808 -31.191 < 2e-16 ***
hr5 -133.0495 4.1168 -32.319 < 2e-16 ***
hr6 -120.2775 4.0370 -29.794 < 2e-16 ***
hr7 -75.5424 3.9916 -18.925 < 2e-16 ***
hr8 23.9511 3.9686 6.035 1.65e-09 ***
hr9 127.5199 3.9500 32.284 < 2e-16 ***
hr10 24.4399 3.9360 6.209 5.57e-10 ***
hr11 -12.3407 3.9361 -3.135 0.00172 **
hr12 9.2814 3.9447 2.353 0.01865 *
hr13 41.1417 3.9571 10.397 < 2e-16 ***
hr14 39.8939 3.9750 10.036 < 2e-16 ***
hr15 30.4940 3.9910 7.641 2.39e-14 ***
hr16 35.9445 3.9949 8.998 < 2e-16 ***
hr17 82.3786 3.9883 20.655 < 2e-16 ***
hr18 200.1249 3.9638 50.488 < 2e-16 ***
hr19 173.2989 3.9561 43.806 < 2e-16 ***
hr20 90.1138 3.9400 22.872 < 2e-16 ***
hr21 29.4071 3.9362 7.471 8.74e-14 ***
hr22 -8.5883 3.9332 -2.184 0.02902 *
hr23 -37.0194 3.9344 -9.409 < 2e-16 ***
workingday 1.2696 1.7845 0.711 0.47681
temp 157.2094 10.2612 15.321 < 2e-16 ***
weathersitcloudy/misty -12.8903 1.9643 -6.562 5.60e-11 ***
weathersitlight rain/snow -66.4944 2.9652 -22.425 < 2e-16 ***
weathersitheavy rain/snow -109.7446 76.6674 -1.431 0.15234
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 76.5 on 8605 degrees of freedom
Multiple R-squared: 0.6745, Adjusted R-squared: 0.6731
F-statistic: 457.3 on 39 and 8605 DF, p-value: < 2.2e-16
What is the difference between the two codings? In mod.lm2, a coefficient estimate is reported for all but the last level of hr and mnth. Importantly, in mod.lm2, the coefficient estimate for the last level of mnth is not zero: instead, it equals the negative of the sum of the coefficient estimates for all of the other levels. Similarly, in mod.lm2, the coefficient estimate for the last level of hr is the negative of the sum of the coefficient estimates for all of the other levels. This means that the coefficients of hr and mnth in mod.lm2 will always sum to zero, and can be interpreted as the difference from the mean level. For example, the coefficient for January of −46.087 indicates that, holding all other variables constant, there are typically 46 fewer riders in January relative to the yearly average.
It is important to realize that the choice of coding really does not matter, provided that we interpret the model output correctly in light of the coding used. For example, we see that the predictions from the linear model are the same regardless of coding:
sum(( predict(mod.lm) - predict(mod.lm2))^2)
[1] 1.586608e-18
The sum of squared differences is zero. We can also see this using the all.equal() function:
all.equal(predict(mod.lm), predict(mod.lm2))
[1] TRUE
The left panel of the figure below displays the coefficients connected with each month of the year from the least squares linear regression model, ‘mod.lm2’. The coefficients for January through November can be directly obtained from the ‘mod.lm2’ object. However, the coefficient for December must be calculated as the negative sum of all the other months. On the other hand, the right panel exhibits the coefficients related to the hour of the day. It indicates that bike usage is highest during peak commute times and lowest overnight. We first obtain the coefficient estimates associated with mnth and hr.
coef.months <- c(coef(mod.lm2)[2:12],-sum(coef(mod.lm2)[2:12]))
coef.hours <- c(coef(mod.lm2)[13:35] , -sum(coef(mod.lm2)[13:35]))
To make the plot, we manually label the x-axis with the names of the months.
par(mfrow = c(1, 2))
plot(coef.months , xlab = "Month", ylab = "Coefficient", xaxt = "n", col = "blue", pch = 19, type = "o")
axis(side = 1, at = 1:12, labels = c("J", "F", "M", "A", "M", "J", "J", "A", "S", "O", "N", "D"))
plot(coef.hours , xlab = "Hour", ylab = "Coefficient",col = "blue", pch = 19, type = "o")
Now, we consider instead fitting a Poisson regression model to the Bikeshare data. Very little changes, except that we now use the function glm() with the argument family = poisson to specify that we wish to fit a Poisson regression model:
mod.pois <- glm( bikers ~ mnth + hr + workingday + temp + weathersit ,data = Bikeshare , family = poisson)
summary(mod.pois)
Call:
glm(formula = bikers ~ mnth + hr + workingday + temp + weathersit,
family = poisson, data = Bikeshare)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.118245 0.006021 683.964 < 2e-16 ***
mnth1 -0.670170 0.005907 -113.445 < 2e-16 ***
mnth2 -0.444124 0.004860 -91.379 < 2e-16 ***
mnth3 -0.293733 0.004144 -70.886 < 2e-16 ***
mnth4 0.021523 0.003125 6.888 5.66e-12 ***
mnth5 0.240471 0.002916 82.462 < 2e-16 ***
mnth6 0.223235 0.003554 62.818 < 2e-16 ***
mnth7 0.103617 0.004125 25.121 < 2e-16 ***
mnth8 0.151171 0.003662 41.281 < 2e-16 ***
mnth9 0.233493 0.003102 75.281 < 2e-16 ***
mnth10 0.267573 0.002785 96.091 < 2e-16 ***
mnth11 0.150264 0.003180 47.248 < 2e-16 ***
hr1 -0.754386 0.007879 -95.744 < 2e-16 ***
hr2 -1.225979 0.009953 -123.173 < 2e-16 ***
hr3 -1.563147 0.011869 -131.702 < 2e-16 ***
hr4 -2.198304 0.016424 -133.846 < 2e-16 ***
hr5 -2.830484 0.022538 -125.586 < 2e-16 ***
hr6 -1.814657 0.013464 -134.775 < 2e-16 ***
hr7 -0.429888 0.006896 -62.341 < 2e-16 ***
hr8 0.575181 0.004406 130.544 < 2e-16 ***
hr9 1.076927 0.003563 302.220 < 2e-16 ***
hr10 0.581769 0.004286 135.727 < 2e-16 ***
hr11 0.336852 0.004720 71.372 < 2e-16 ***
hr12 0.494121 0.004392 112.494 < 2e-16 ***
hr13 0.679642 0.004069 167.040 < 2e-16 ***
hr14 0.673565 0.004089 164.722 < 2e-16 ***
hr15 0.624910 0.004178 149.570 < 2e-16 ***
hr16 0.653763 0.004132 158.205 < 2e-16 ***
hr17 0.874301 0.003784 231.040 < 2e-16 ***
hr18 1.294635 0.003254 397.848 < 2e-16 ***
hr19 1.212281 0.003321 365.084 < 2e-16 ***
hr20 0.914022 0.003700 247.065 < 2e-16 ***
hr21 0.616201 0.004191 147.045 < 2e-16 ***
hr22 0.364181 0.004659 78.173 < 2e-16 ***
hr23 0.117493 0.005225 22.488 < 2e-16 ***
workingday 0.014665 0.001955 7.502 6.27e-14 ***
temp 0.785292 0.011475 68.434 < 2e-16 ***
weathersitcloudy/misty -0.075231 0.002179 -34.528 < 2e-16 ***
weathersitlight rain/snow -0.575800 0.004058 -141.905 < 2e-16 ***
weathersitheavy rain/snow -0.926287 0.166782 -5.554 2.79e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 1052921 on 8644 degrees of freedom
Residual deviance: 228041 on 8605 degrees of freedom
AIC: 281159
Number of Fisher Scoring iterations: 5
We can plot the coefficients associated with mnth and hr
par(mfrow = c(1, 2))
coef.mnth <- c(coef(mod.pois)[2:12] ,-sum(coef(mod.pois)[2:12]))
plot(coef.mnth , xlab = "Month", ylab = "Coefficient",xaxt = "n", col = "blue", pch = 19, type = "o")
axis(side = 1, at = 1:12, labels = c("J", "F", "M", "A", "M","J", "J", "A", "S", "O", "N", "D"))
coef.hours <- c(coef(mod.pois)[13:35] ,-sum(coef(mod.pois)[13:35]))
plot(coef.hours , xlab = "Hour", ylab = "Coefficient",col = "blue", pch = 19, type = "o")
We can once again use the predict() function to obtain the fitted values (predictions) from this Poisson regression model. However, we must use the argument type = "response" to specify that we want R to output \(\exp \left(\hat{\beta}_0+\right.\) \(\left.\hat{\beta}_1 X_1+\ldots+\hat{\beta}_p X_p\right)\) rather than \(\hat{\beta}_0+\hat{\beta}_1 X_1+\ldots+\hat{\beta}_p X_p\), which it will output by default.
plot(predict(mod.lm2), predict(mod.pois , type = "response"))
abline (0, 1, col = 2, lwd = 3)
The predictions from the Poisson regression model are correlated with those from the linear model; however, the former are non-negative. As a result the Poisson regression predictions tend to be larger than those from the linear model for either very low or very high levels of ridership.
In this section, we used the glm() function with the argument family = poisson in order to perform Poisson regression. Earlier in this lab we used the glm() function with family = binomial to perform logistic regression. Other choices for the family argument can be used to fit other types of GLMs. For instance, family = Gamma fits a gamma regression model.
We explore the use of the validation set approach in order to estimate the test error rates that result from fitting various linear models on the Auto data set.
Before we begin, we use the set.seed() function in order to set a seed for seed R’s random number generator, so that the reader will obtain precisely the same results as those shown below. It is generally a good idea to set a random seed when performing an analysis such as cross-validation that contains an element of randomness, so that the results obtained can be reproduced precisely at a later time.
We begin by using the sample() function to split the set of observations
into two halves, by selecting a random subset of 196 observations out of the original 392 observations. We refer to these observations as the training set.
rm(Auto) #Remove Imported Auto data to use Auto Data from ISLR
library(ISLR)
set.seed(1)
train <- sample(392, 196)
(Here we use a shortcut in the sample command; see ?sample for details.) We then use the subset option in lm() to fit a linear regression using only the observations corresponding to the training set.
lm.fitA <- lm(mpg ~ horsepower, data = Auto, subset = train)
trainSET <- Auto[train, ]
validSET <- Auto[-train, ]
lm.fitB <- lm(mpg ~ horsepower, data = trainSET)
summary(lm.fitA)
Call:
lm(formula = mpg ~ horsepower, data = Auto, subset = train)
Residuals:
Min 1Q Median 3Q Max
-9.3177 -3.5428 -0.5591 2.3910 14.6836
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 41.283548 1.044352 39.53 <2e-16 ***
horsepower -0.169659 0.009556 -17.75 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.032 on 194 degrees of freedom
Multiple R-squared: 0.619, Adjusted R-squared: 0.6171
F-statistic: 315.2 on 1 and 194 DF, p-value: < 2.2e-16
summary(lm.fitB)
Call:
lm(formula = mpg ~ horsepower, data = trainSET)
Residuals:
Min 1Q Median 3Q Max
-9.3177 -3.5428 -0.5591 2.3910 14.6836
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 41.283548 1.044352 39.53 <2e-16 ***
horsepower -0.169659 0.009556 -17.75 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.032 on 194 degrees of freedom
Multiple R-squared: 0.619, Adjusted R-squared: 0.6171
F-statistic: 315.2 on 1 and 194 DF, p-value: < 2.2e-16
caret to accomplish the same thinglibrary(caret)
set.seed(3)
trainIndex <- createDataPartition(y = Auto$mpg,
p = 0.5,
list = FALSE,
times = 1)
training <- Auto[trainIndex, ]
testing <- Auto[-trainIndex, ]
dim(training)
[1] 198 9
dim(testing)
[1] 194 9
#
myControl1 <- trainControl(method = "none")
myControl2 <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5)
#
mod_lm <- train(mpg ~ horsepower,
data = training,
trControl = myControl1,
method = "lm")
#
mod_lm
Linear Regression
198 samples
1 predictor
No pre-processing
Resampling: None
summary(mod_lm)
Call:
lm(formula = .outcome ~ ., data = dat)
Residuals:
Min 1Q Median 3Q Max
-9.4938 -3.7061 -0.4894 2.9283 14.9663
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 40.06917 1.03089 38.87 <2e-16 ***
horsepower -0.15575 0.00909 -17.14 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.086 on 196 degrees of freedom
Multiple R-squared: 0.5997, Adjusted R-squared: 0.5976
F-statistic: 293.6 on 1 and 196 DF, p-value: < 2.2e-16
We now use the predict() function to estimate the response for all 392 observations, and we use the mean() function to calculate the MSE of the 196 observations in the validation set. Note that the -train index below selects only the observations that are not in the training set.
EMSE <- mean((Auto$mpg - predict(lm.fitA, newdata = Auto))[-train]^2)
EMSE
[1] 23.26601
# Or
EMSE <- mean((validSET$mpg - predict(lm.fitA, newdata = validSET))^2)
EMSE
[1] 23.26601
Use the RMSE() function from caret to compute the \(MSE_{testing}\).
# Using caret
RMSE(predict(mod_lm, testing), testing$mpg)^2
[1] 22.50887
Therefore, the estimated test MSE for the linear regression fit is 23.2660086. We can use the poly() function to estimate the test error for the polynomial and cubic regressions.
lm.fit2 <- lm(mpg ~ poly(horsepower, 2), data = Auto, subset = train)
EMSE2 <- mean((Auto$mpg - predict(lm.fit2, newdata = Auto))[-train]^2)
EMSE2
[1] 18.71646
# Or
EMSE2 <- mean((validSET$mpg - predict(lm.fit2, newdata = validSET))^2)
EMSE2
[1] 18.71646
Use the RMSE() function from caret to compute the \(MSE_{testing}\) for a second order polynomial model.
# Using caret
mod_lm2 <- train(mpg ~ poly(horsepower, 2),
data = training,
trControl = myControl1,
method = "lm")
RMSE(predict(mod_lm2, testing), testing$mpg)^2
[1] 19.592
lm.fit3 <- lm(mpg ~ poly(horsepower, 3), data = Auto, subset = train)
EMSE3 <- mean((Auto$mpg - predict(lm.fit3, newdata = Auto))[-train]^2)
EMSE3
[1] 18.79401
# Or
EMSE3 <- mean((validSET$mpg - predict(lm.fit3, newdata = validSET))^2)
EMSE3
[1] 18.79401
Use the RMSE() function from caret to compute the \(MSE_{testing}\) for a third order polynomial model.
# using caret---Your CODE HERE
mod_lm3 <- train(mpg ~ poly(horsepower, 3),
data = training,
trControl = myControl1,
method = "lm")
RMSE(predict(mod_lm3, testing), testing$mpg)^2
[1] 19.54432
These error rates are 18.7164595 and 18.7940068, respectively. If we choose a different training set instead, then we will obtain somewhat different errors on the validation set.
set.seed(231)
train <- sample(392, 196)
trainSET <- Auto[train, ]
validSET <- Auto[-train, ]
lm.fit1 <- lm(mpg ~ horsepower, data = Auto, subset = train)
EMSE1a <- mean((Auto$mpg - predict(lm.fit1, newdata = Auto))[-train]^2)
EMSE1a
[1] 25.14997
# Or
EMSE1a <- mean((validSET$mpg - predict(lm.fit1, newdata = validSET))^2)
EMSE1a
[1] 25.14997
lm.fit2 <- lm(mpg ~ poly(horsepower, 2), data = Auto, subset = train)
EMSE2a <- mean((Auto$mpg - predict(lm.fit2, newdata = Auto))[-train]^2)
EMSE2a
[1] 21.82773
# Or
EMSE2a <- mean((validSET$mpg - predict(lm.fit2, newdata = validSET))^2)
EMSE2a
[1] 21.82773
lm.fit3 <- lm(mpg ~ poly(horsepower, 3), data = Auto, subset = train)
EMSE3a <- mean((Auto$mpg - predict(lm.fit3, newdata = Auto))[-train]^2)
EMSE3a
[1] 21.91503
# Or
EMSE3a <- mean((validSET$mpg - predict(lm.fit3, newdata = validSET))^2)
EMSE3a
[1] 21.91503
Using this split of the observations into a training set and a validation set, we find that the validation set error rates for the models with linear, quadratic, and cubic terms are 25.1499689, 21.8277328, and 21.9150317, respectively.
These results are consistent with our previous findings: a model that predicts mpg using a quadratic function of horsepower performs better than a model that involves only a linear function of horsepower, and there is little evidence in favor of a model that uses a cubic function of horsepower.
The LOOCV estimate can be automatically computed for any generalized linear model using the glm() and cv.glm() functions. If we use glm() to fit a model without passing in the family argument, then it performs linear regression, just like the lm() function. So for instance,
glm.fit <- glm(mpg ~ horsepower, data = Auto)
coef(glm.fit)
(Intercept) horsepower
39.9358610 -0.1578447
and
lm.fit <- lm(mpg ~ horsepower, data = Auto)
coef(lm.fit)
(Intercept) horsepower
39.9358610 -0.1578447
yield identical linear regression models. In this lab, we will perform linear regression using the glm() function rather than the lm() function because the latter can be used together with cv.glm(). The cv.glm() function is part of the boot package.
library(boot)
glm.fit <- glm(mpg ~ horsepower, data = Auto)
cv.err = cv.glm(data = Auto, glmfit = glm.fit)
cv.err$delta[1]
[1] 24.23151
Use the method = "LOOCV" from caret to estimate the leave one out cross validation estimate of the test error for a model that regresses mpg onto horsepower.
# Using caret
myControl3 <- trainControl(method = "LOOCV")
mod_CV <- train(mpg ~ horsepower,
data = Auto,
trControl = myControl3,
method = "lm")
mod_CV$results
intercept RMSE Rsquared MAE
1 TRUE 4.922552 0.6012358 3.848748
mod_CV$results$RMSE^2
[1] 24.23151
The cv.glm() function produces a list with several components. The two numbers in the delta vector contain the cross-validation results. In this cv.glm() case the numbers are identical (up to two decimal places) and correspond to the LOOCV statistic. Below, we discuss a situation in which the two numbers differ. Our cross-validation estimate for the test error is approximately 24.2315135.
We can repeat this procedure for increasingly complex polynomial fits. To automate the process, we use the for() function to initiate a for loop which iteratively fits polynomial regressions for polynomials of order i = 1 to i = 5, computes the associated cross-validation error, and stores it in the i\(^{\text{th}}\) element of the vector cv.error. We begin by initializing the vector. This command will likely take a couple of minutes to run.
cv.error <- numeric(5)
for(i in 1:5){
glm.fit <- glm(mpg ~ poly(horsepower, i), data = Auto)
cv.error[i] <- cv.glm(data = Auto, glmfit = glm.fit)$delta[1]
}
cv.error
[1] 24.23151 19.24821 19.33498 19.42443 19.03321
Use the method = "LOOCV" from caret to estimate the leave one out cross validation estimate of the test error for increasingly complex polynomial fits of order i = 1 to i = 5 that regress mpg onto horsepower.
# Using caret
degree <- 1:5
CV <- numeric(5)
for(d in degree){
f <- bquote(mpg ~ poly(horsepower, .(d)))
mod_CV <- train(as.formula(f),
data = Auto,
trControl = myControl3,
method = "lm")
CV[d] <- mod_CV$results$RMSE^2
}
CV
[1] 24.23151 19.24821 19.33498 19.42443 19.03321
The cv.glm() function can also be used to implement \(k\)-fold CV. Below we use k = 10, a common choice for \(k\), on the Auto data set. We once again set a random seed and initialize a vector in which we will store the CV errors corresponding to the polynomial fits of orders one to ten.
set.seed(17)
cv.error.10 <- numeric(10)
for(i in 1:10){
glm.fit <- glm(mpg ~ poly(horsepower, i), data = Auto)
cv.error.10[i] <- cv.glm(data = Auto, glmfit = glm.fit, K = 10)$delta[1]
}
cv.error.10
[1] 24.27207 19.26909 19.34805 19.29496 19.03198 18.89781 19.12061 19.14666
[9] 18.87013 20.95520
Use the method = "cv" with number = 10 for K= 10 folds from caret to estimate the 10 fold cross validation estimate of the test error for increasingly complex polynomial fits of order i = 1 to i = 10 that regress mpg onto horsepower.
# using caret---Your CODE HERE
set.seed(172)
myControl4 <- trainControl(method = "cv",
number = 10)
degree <- 1:10
CV <- numeric(10)
for(d in degree){
f <- bquote(mpg ~ poly(horsepower, .(d)))
mod_CV <- train(as.formula(f),
data = Auto,
trControl = myControl4,
method = "lm")
CV[d] <- mod_CV$results$RMSE^2
}
CV
[1] 23.91805 18.80915 19.23979 18.78304 18.87683 18.56381 18.64721 18.71842
[9] 18.76923 18.83391
Notice that the computation time is much shorter than that of LOOCV. (In principle, the computation time for LOOCV for a least squares linear model should be faster than for \(k\)-fold CV, due to the availability of \[CV_{(n)} = \frac{1}{n}\sum_{i = 1}^n\left(\frac{y_i - \hat{y}_i}{1 - h_i}\right)^2,\] for LOOCV; however, unfortunately the cv.glm() function does not make use of this formula.) We still see little evidence that using cubic or higher-order polynomial terms leads to lower test error than simply using a quadratic fit.
We saw in the last section that the two numbers associated with delta are essentially the same when LOOCV is performed. When we instead perform \(k\)-fold CV, then the two numbers associated with delta differ slightly. The first is the standard \(k\)-fold CV estimate, as in \[CV_{(k)} = \frac{1}{k}\sum_{i=1}^kMSE_i.\] The second is a bias-corrected version. On this data set, the two estimates are very similar to each other.
Compute 10-fold Cross Validation errors for mod.be, mod.fs, mod1, mod2, and mod3 from Project 1.
site <- "http://ww2.amstat.org/publications/jse/datasets/homes76.dat.txt"
HP <- read.table(file = site, header = TRUE)
# Removing indicated columns
HP <- HP[ ,-c(1, 7, 10, 15, 16, 17, 18, 19)]
# Renaming columns
names(HP) <- c("price", "size", "lot", "bath", "bed", "year", "age",
"garage", "status", "active", "elem")
mod.be <- glm(price ~ size + lot + bed + status + elem, data = HP)
mod.fs <- glm(price ~ elem + garage + lot + active + size + bed, data = HP)
mod1 <- glm(price ~ . - status - year, data = HP)
mod2 <- update(mod1, .~. + bath:bed + I(age^2))
mod3 <- glm(price ~ size + lot + bath + bed + bath:bed + age + I(age^2) +
garage + active + I(elem == 'edison') + I(elem == 'harris'),
data = HP)
library(boot)
set.seed(461)
CV10mod.be <- cv.glm(data = HP, glmfit = mod.be, K = 10)$delta[1]
CV10mod.be
[1] 2372.26
CV10mod.fs <- cv.glm(data = HP, glmfit = mod.fs, K = 10)$delta[1]
CV10mod.fs
[1] 2416.15
CV10mod1 <- cv.glm(data = HP, glmfit = mod1, K = 10)$delta[1]
CV10mod1
[1] 2844.668
CV10mod2 <- cv.glm(data = HP, glmfit = mod2, K = 10)$delta[1]
CV10mod2
[1] 2639.653
CV10mod3 <- cv.glm(data = HP, glmfit = mod3, K = 10)$delta[1]
CV10mod3
[1] 2297.885
Use the method = cv" with number = 10 from caret to estimate the ten fold cross validation estimate of the test error for the five models from Project 1.
# Using caret
set.seed(461)
myControl5 <- trainControl(method = "cv",
number = 10)
mod_BE <- train(price ~ size + lot + bed + status + elem,
data = HP,
trControl =myControl5,
method = "lm")
mod_BE$results$RMSE^2
[1] 2142.238
### Below is what the others should have
mod_FS <- train(price ~ elem + garage + lot + active + size + bed,
data = HP,
trControl =myControl5,
method = "lm")
mod_FS$results$RMSE^2
[1] 2050.674
mod_MOD1 <- train(price ~ . - status - year,
data = HP,
trControl =myControl5,
method = "lm")
mod_MOD1$results$RMSE^2
[1] 2492.183
mod_MOD2 <- train(price ~ . - status - year + bath:bed + I(age^2),
data = HP,
trControl =myControl5,
method = "lm")
mod_MOD2$results$RMSE^2
[1] 2480.545
mod_MOD3 <- train(price ~ size + lot + bath + bed + bath:bed + age + I(age^2) +
garage + active + I(elem == 'edison') + I(elem == 'harris'),
data = HP,
trControl =myControl5,
method = "lm")
mod_MOD3$results$RMSE^2
[1] 2164.437
Complete the table below using inline R code to report your answers.
Table of Results
| Model | MSE from cv.glm() |
MSE from caret |
|---|---|---|
| Backward elimination | 2372.26 | 2142.24 |
| Forward selection | 2416.15 | 2050.67 |
Model 1 (mod1) |
2844.67 | 2492.18 |
Model 2 (mod2) |
2639.65 | 2480.55 |
Model 3 (mod3) |
2297.88 | 2164.44 |
This document uses DT by Xie, Cheng, and Tan (2024), ggplot2 by Wickham et al. (2024), ISLR by James et al. (2021), ISLR2 by James et al. (2022),,MASS by Ripley (2024), plotly by Sievert et al. (2024), rmarkdown by Allaire et al. (2024), dplyr by Wickham et al. (2023), knitr by Xie (2024b), class by Ripley (2023), and bookdown by Xie (2024a).
sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: x86_64-apple-darwin20
Running under: macOS Sonoma 14.6.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] bookdown_0.40 rmarkdown_2.27 boot_1.3-30 caret_6.0-94 lattice_0.22-6
[6] class_7.3-22 ISLR2_1.3-2 broom_1.0.6 car_3.1-2 carData_3.0-5
[11] ISLR_1.4 MASS_7.3-60.2 dplyr_1.1.4 plotly_4.10.4 ggvis_0.4.9
[16] ggplot2_3.5.1 DT_0.33 knitr_1.48
loaded via a namespace (and not attached):
[1] pROC_1.18.5 rlang_1.1.4 magrittr_2.0.3
[4] compiler_4.4.0 mgcv_1.9-1 vctrs_0.6.5
[7] reshape2_1.4.4 stringr_1.5.1 pkgconfig_2.0.3
[10] fastmap_1.2.0 backports_1.5.0 labeling_0.4.3
[13] utf8_1.2.4 promises_1.3.0 prodlim_2024.06.25
[16] purrr_1.0.2 xfun_0.45 cachem_1.1.0
[19] jsonlite_1.8.8 recipes_1.1.0 highr_0.11
[22] later_1.3.2 jpeg_0.1-10 parallel_4.4.0
[25] R6_2.5.1 bslib_0.7.0 stringi_1.8.4
[28] parallelly_1.37.1 rpart_4.1.23 lubridate_1.9.3
[31] jquerylib_0.1.4 Rcpp_1.0.12 assertthat_0.2.1
[34] iterators_1.0.14 future.apply_1.11.2 httpuv_1.6.15
[37] Matrix_1.7-0 splines_4.4.0 nnet_7.3-19
[40] timechange_0.3.0 tidyselect_1.2.1 rstudioapi_0.16.0
[43] abind_1.4-5 yaml_2.3.8 timeDate_4032.109
[46] codetools_0.2-20 listenv_0.9.1 tibble_3.2.1
[49] plyr_1.8.9 shiny_1.9.1 withr_3.0.0
[52] evaluate_0.24.0 future_1.33.2 survival_3.5-8
[55] pillar_1.9.0 foreach_1.5.2 stats4_4.4.0
[58] generics_0.1.3 munsell_0.5.1 scales_1.3.0
[61] globals_0.16.3 xtable_1.8-4 glue_1.7.0
[64] lazyeval_0.2.2 tools_4.4.0 data.table_1.15.4
[67] ModelMetrics_1.2.2.2 gower_1.0.1 grid_4.4.0
[70] tidyr_1.3.1 crosstalk_1.2.1 ipred_0.9-14
[73] colorspace_2.1-0 nlme_3.1-164 cli_3.6.2
[76] fansi_1.0.6 viridisLite_0.4.2 lava_1.8.0
[79] gtable_0.3.5 sass_0.4.9 digest_0.6.35
[82] htmlwidgets_1.6.4 farver_2.1.2 htmltools_0.5.8.1
[85] lifecycle_1.0.4 hardhat_1.4.0 httr_1.4.7
[88] mime_0.12
The material in Section 1 is modified from the Chapter 2 lab of An introduction to Statistical Learning.↩︎
Use the argument dpi = 96 inside include_graphics().↩︎